reticulate::use_python("/anaconda3/bin/python")

library(readr)
library(ggplot2)
library(dplyr)
library(methods)
library(keras)

theme_set(theme_minimal())

MNIST

The National Institute of Standards and Technology (NIST) is a US federal agency specifically tasked with developing measurement standards. The agency has put together several datasets of images showing handwritten digits. These were standardized and put together by Yann LeCun and Corinna Cortes while at AT&T Labs into a classification dataset. The prediction task is to recognize what number is shown in the image.

The dataset, known as MNIST or MNIST-10, is a very commonly used training set when learning how to do image recognition. In academic research, it is certainly overused, but it does make a great example for teaching purposes. The images are relatively small but still represent an interesting classification task. The primary reason for this is that digit classification does not require color or high-resolution images, making the data size relatively small.

As with the class images and the flowers dataset, the MNIST data is split into metadata and pixel counts. We will read both of these into R here:

mnist <- read_csv("https://statsmaths.github.io/ml_data/mnist_10.csv")
x28 <- read_rds("image_data/mnist_10_x28.rds")

A link to download the dataset is given in today’s lab. These images are in black and white. Let’s look at the dimension of the data to make sure that it makes sense to us:

dim(x28)
## [1] 60000    28    28     1

There are 60000 total images, with 28 rows, 28 columns, and one color channel. The one color channel exists because the images are in black and white. Before diving into the classification task, we can take a quick look at what the digits actually look like. Here are 60 examples of the digit 2

par(mar = c(0,0,0,0))
par(mfrow = c(6, 10))
for (i in sample(which(mnist$class == 2), 60)) {
  plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
  rasterImage(x28[i,,,],0,0,1,1)
}

We can quickly recognize that all of these are the digit 2, though note that the specific pixel values between two different 2’s can be extremely different.

Neural networks: Classification and one-hot encoding

It is now time to return to neural networks. However, how do we make the neural network deal with a categorical output? The trick for doing multiclass regression with neural networks is to realize that we can easily make the last layer of a neural network output multiple values. All we need to do is have the function that we want to minimize by a function of all these outputs rather than a single output with each row of the data. An easy way to do this is to assign multiple response variables with each observation and do mean squared error loss, but we will now take the mean squared loss of predicting all the outputs.

The keras package include a function to_categorical that converts numeric class labels to binary indicator variables. These are called a one-hot encoding. This is very similar to the model matricies we built for the X matrix when using categorical predictors. Here is a simple example of its application:

to_categorical(c(1,1,2,4,10))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,]    0    1    0    0    0    0    0    0    0     0     0
## [2,]    0    1    0    0    0    0    0    0    0     0     0
## [3,]    0    0    1    0    0    0    0    0    0     0     0
## [4,]    0    0    0    0    1    0    0    0    0     0     0
## [5,]    0    0    0    0    0    0    0    0    0     0     1

Notice that keras is 0-indexed, so it wants the first category to be zero. That’s why we have 11 columns even thought the largest category is ten. The best way to understand how to use this response matrix is to see a worked example in keras.

Construct training data

Let’s now apply this to the MNIST dataset. We first flatten the dataset x28 and then construct a training set of data. This is exactly the same as we did last time with the flowers image dataset.

X <- t(apply(x28, 1, cbind))
y <- mnist$class

X_train <- X[mnist$train_id == "train",]
y_train <- to_categorical(y[mnist$train_id == "train"], num_classes = 10)

Notice that here the smallest category is already “0”, so we can put the response y in directly to the to_categorical function.

Building the neural network

Next, we construct the model architecture for the neural network. Here we have 28^2 inputs, one hidden layer with 128 neurons, and an output layer with 10 neurons. A rectified linear unit is used to “activate” the hidden layer and a “softmax” function is used to turn the outputs into probabilities.

model <- keras_model_sequential()
model %>%
  layer_dense(units = 128, input_shape = c(28^2)) %>%
  layer_activation(activation = "relu") %>%
  layer_dense(units = 10) %>%
  layer_activation(activation = "softmax")

model %>% compile(loss = 'categorical_crossentropy',
                  optimizer = optimizer_rmsprop(),
                  metrics = c('accuracy'))

model
## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 128)                   100480      
## ___________________________________________________________________________
## activation_1 (Activation)        (None, 128)                   0           
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 10)                    1290        
## ___________________________________________________________________________
## activation_2 (Activation)        (None, 10)                    0           
## ===========================================================================
## Total params: 101,770
## Trainable params: 101,770
## Non-trainable params: 0
## ___________________________________________________________________________

This is a small model, but notice that it already has over 100k parameters to train! If you prefer mathematical notation, here is how we can describe this neural network:

\[ \widehat{Y} = \text{softmax} \left(a_2 + \sigma(a_1 + X \cdot B_1) \cdot B_2 \right) \]

Where the B’s are matrices of weights that need to be learned, the a’s are vectors of biases that need to be learned, and sigma is the rectified linear unit (ReLU). The ReLU turns negative values into zero and operates component wise on the matrix.

Next, we compile the model (notice that this is different because we are doing classification):

model %>% compile(loss = 'categorical_crossentropy',
                  optimizer = optimizer_sgd(lr = 0.01, momentum = 0.9,
                                            nesterov = TRUE),
                  metrics = c('accuracy'))

And fit it on our training data (I taken 20% of the “training” set and re-assigned it to a secondary validation set):

history <- model %>%
  fit(X_train, y_train, epochs = 10, validation_split = 0.2)
plot(history)

The model seems to do very well will a minimal amount of effort on our part. Here is how it performs on the real validation set:

y_pred <- predict_classes(model, X)
tapply(y == y_pred, mnist$train_id, mean)
##      test     train     valid 
##        NA 0.9868667 0.9646000

Over 96%, which is quite good given that we have just a single hidden layer neural network with a minimal set of tweaks to the inputs.

Which categories are the hardest to distinguish?

table(y[mnist$train_id == "train"], y_pred[mnist$train_id == "train"])
##    
##        0    1    2    3    4    5    6    7    8    9
##   0 2980    0    1    1    2    6    2    2    5    1
##   1    0 2971    8    4    5    0    3    5    3    1
##   2    4    1 2965    4    1    1    4    8   10    2
##   3    1    2    8 2933    1   27    1    8   10    9
##   4    0    2    3    0 2955    0    4    6    0   30
##   5    0    0    1    5    0 2976    9    1    4    4
##   6    4    4    1    1    0   10 2979    0    1    0
##   7    0    9   11    3    4    2    0 2953    2   16
##   8    2    8    4    9    2   18    6    0 2945    6
##   9    4    3    1    6    8    9    0   13    7 2949

It seems the we most often confuse 9’s and 4’s; 8’s and 3’s, and 7’s and 9’s. Thinking about the shape of these digits all of these confusions should seem reasonable.

Visualize weights

[Note: Understand this section, but keep in mind that you do not need to run this code every time you build a neural network.]

There are many ways of accessing and modifying the weights inside of a training keras model. The get_layer function returns a pointer to a particular layer in a neural network (it lives in Python, so the indices start at 0). There are two sets of “weights” in each layer: the weights corresponding to the matrix “B” and the bias (called “a” in our equation above).

layer <- get_layer(model, index = 1)
dim(layer$get_weights()[[1]])
## [1] 784 128
dim(layer$get_weights()[[2]])
## [1] 128

Here, we grabbed the first Dense layer, which has 28^2 by 128 weights. It can sometimes be insightful to visualize these weights. Let’s create a new, much smaller model. This one does not even have a hidden layer. It is essentially a multinomial model, albeit with a different algorithm for learning the weights.

model <- keras_model_sequential()
model %>%
  layer_dense(units = 10, input_shape = c(28^2)) %>%
  layer_activation(activation = "softmax")

model %>% compile(loss = 'categorical_crossentropy',
                  optimizer = optimizer_rmsprop(),
                  metrics = c('accuracy'))

history <- model %>%
  fit(X_train, y_train, epochs = 10, validation_split = 0.2)
plot(history)

It is still a fairly good model, relative to its simplicity:

y_pred <- predict_classes(model, X)
tapply(y == y_pred, mnist$train_id, mean)
##      test     train     valid 
##        NA 0.9278333 0.9168667

With a validation classification of about 91.5%. Lets grab the weights for the one layer with trainable parameters and turn this into an array

layer <- get_layer(model, index = 1)
B <- array(layer$get_weights()[[1]], dim = c(28, 28, 10))

The code to visualize these weights is a bit complex. Focus on the output rather than this code.

relu <- function(mat) {
  id <- which(mat <= 0)
  mat[id] <- 0
  mat
}

par(mar = rep(0, 4L))
par(mfrow = c(3, 4))
for (j in 1:10) {
  v <- B[,,j]
  b <- array(1, dim = c(28, 28, 4))
  b[,,1] <- 1 - relu(v / max(abs(v)))
  b[,,2] <- 1 - relu(v / max(abs(v)) * -1)
  b[,,3] <- 1 - relu(v / max(abs(v)))
  b[,,4] <- ifelse(v == 0, 0, 1)
  plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
  rasterImage(b,0,0,1,1)
  box()
}

Green shows positive weights and purple shows negative ones. Remember that these 10 images relate to the ten digits. They show which pixels need to be activated (“white” in our formulation). Notice that images in this neural network are largely determine based on where the image is not colored. You can almost see the outlines of the digits in the images if you squint at these.

Another visualization technique that we can do with neural networks is to find the inputs with the highest probabilities. Here are the most typical images for each class:

y_pred <- predict_proba(model, X)
id <- apply(y_pred, 2, which.max)
par(mar = c(0,0,0,0))
par(mfrow = c(3, 4))
for (i in id) {
  plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
  rasterImage(x28[i,,,],0,0,1,1)
}

Compared to the rest of the corpus, these are all relatively thick but small.

Flowers with Neural Networks

Just for comparison, let’s try to use neural networks to fit the flowers dataset.

flowers <- read_csv("~/gh/ml_data_full/flowers_17.csv")
x64 <- read_rds("image_data/flowers_17_x64.rds")

X <- t(apply(x64, 1, cbind))
y <- flowers$class

X_train <- X[flowers$train_id == "train",]
y_train <- to_categorical(y[flowers$train_id == "train"], num_classes = 17)

The model we use with have three hidden layers of 128 parameters each.

model <- keras_model_sequential()
model %>%
  layer_dense(units = 128, input_shape = c(ncol(X_train))) %>%
  layer_activation(activation = "relu") %>%
  layer_dense(units = 128) %>%
  layer_activation(activation = "relu") %>%
  layer_dense(units = 128) %>%
  layer_activation(activation = "relu") %>%
  layer_dense(units = 17) %>%
  layer_activation(activation = "softmax")

model %>% compile(loss = 'categorical_crossentropy',
                  optimizer = optimizer_sgd(lr = 0.003, momentum = 0.9,
                                            nesterov = TRUE),
                  metrics = c('accuracy'))

model
## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_4 (Dense)                  (None, 128)                   1572992     
## ___________________________________________________________________________
## activation_4 (Activation)        (None, 128)                   0           
## ___________________________________________________________________________
## dense_5 (Dense)                  (None, 128)                   16512       
## ___________________________________________________________________________
## activation_5 (Activation)        (None, 128)                   0           
## ___________________________________________________________________________
## dense_6 (Dense)                  (None, 128)                   16512       
## ___________________________________________________________________________
## activation_6 (Activation)        (None, 128)                   0           
## ___________________________________________________________________________
## dense_7 (Dense)                  (None, 17)                    2193        
## ___________________________________________________________________________
## activation_7 (Activation)        (None, 17)                    0           
## ===========================================================================
## Total params: 1,608,209
## Trainable params: 1,608,209
## Non-trainable params: 0
## ___________________________________________________________________________

Now, we can try a model on this data. There are far fewer data points here, so I will use more iterations.

history <- model %>%
  fit(X_train, y_train, epochs = 100, validation_split = 0.2)
plot(history)

The model seems to plateau around a classification rate just over 50%. It does not perform even as well on the true validation set:

y_pred <- predict_classes(model, X)
tapply(y == y_pred, flowers$train_id, mean)
##      test     train     valid 
## 0.3000000 0.9102941 0.4705882

Why does it not do as well as perhaps your non-NN models were able to do? The issue is that we were constructing features that used the shape of the image and the relationship between the color channels. Next time we will see how to do this in the context of neural networks.